Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken title search ut #411

Merged
merged 15 commits into from
Sep 16, 2024
Merged

Broken title search ut #411

merged 15 commits into from
Sep 16, 2024

Conversation

nadolskit
Copy link
Contributor

Summary of Problem
The expected vs asserted citation count was not matching, causing this UT to specifically fail. 

I noticed if I switched the order of sources from semantic_scholar, crossref to crossref, semantic_scholar the citation count would be even more dramatically different (similar to the results James was noticing earlier)

More importantly, I found the papers on semantic scholar (25); which doesn't match our expected value (23); so that failure makes sense.
The following assertion is failing for the same reason: 196 assertion vs 191 expected.

Vibe checked the results of this change via a CI run: https://github.com/Future-House/paper-qa/actions/runs/10859898582/job/30139826477

This does not fix the issue that James found earlier this evening in test_agents.py.

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. bug Something isn't working labels Sep 14, 2024
tests/utils/paper_helpers.py Outdated Show resolved Hide resolved
Comment on lines 12 to 16
citation_pattern = r"(This article has )\d+( citations?)"

# between group 1 and 2, replace with the character "n"
expected_cleaned = re.sub(citation_pattern, r"\1n\2", expected).strip()
actual_cleaned = re.sub(citation_pattern, r"\1n\2", actual).strip()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you got the capturing groups wrong, you want r"This article has (\d+) citations?"

We don't care to extract the text, we want to extract the number

Then you can just directly compare the int, no need for the weird r"\1n\2"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair.
This original approach using \1n\2 was because it's focusing on comparing the structure of the citation text while ignoring the citation count itself; just ignoring whatever number is there.

But you make a valid point. I'll get this updated!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fwiw (I am not sure your experience levels with regex) going from r"(This article has )\d+( citations?)" to r"This article has (\d+) citations?" will still match the surrounding text too. A capturing group () is here to make it easy to capture and extract a value

And plz point out if we're on the same page already haha. Hope you have a good night

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I've addressed this accordingly. Thx James!

:return: True if the citations match except for the citation count, False otherwise.
"""
# https://regex101.com/r/lCN8ET/1
citation_pattern = r"(This article has )\d+( citations?)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex is closely coupled to DocDetails. Can you move this regex to a ClassVar[str] on DocDetails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Sep 14, 2024
@mskarlin
Copy link
Collaborator

The minor variations, i.e. 196 vs 191 is expected, because the papers will get more citations over time. The bigger question is why the cassettes aren’t being used because this should come back with a recorded request each time, I’m in favor of root causing that before we merge this. Maybe we regenerated these cassettes but didn’t update the test?

Comment on lines 128 to 133
assert (
expected_citation_format is not None
), "Expected string should match the citation pattern"
assert (
actual_citation_format is not None
), "Actual string should match the citation pattern"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert (
expected_citation_format is not None
), "Expected string should match the citation pattern"
assert (
actual_citation_format is not None
), "Actual string should match the citation pattern"
assert (
expected_citation_format
), "Expected string should match the citation pattern"
assert (
actual_citation_format
), "Actual string should match the citation pattern"

This may resolve the union-attr ignore a few lines below

Comment on lines 135 to 148
expected_remaining = (
paper_attributes["formatted_citation"][: expected_citation_format.start()]
+ paper_attributes["formatted_citation"][expected_citation_format.end() :]
)

actual_remaining = (
details.formatted_citation[: actual_citation_format.start()] # type: ignore[union-attr]
+ details.formatted_citation[actual_citation_format.end() :] # type: ignore[union-attr]
)

# Assert that the parts of the strings outside the citation count are identical
assert (
expected_remaining == actual_remaining
), "Formatted citation text should match except for citation count"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are doing a regex search to find a region, then removing that region, and lastly doing an equality check on the front/back end.

I think it would be simpler to do one regex removal of the region, and directly equality compare the remanants.

I also think this is what you originally had 😅 😓 hahaha sorry! I actually realized I misunderstood your original logic last night, my bad here.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely! no worries, communicating through PRs can be hard sometimes. Thank you!! adjusting now.

before I merge this I'm also going to look into why the cassette file aren't (hasn't) been updated

tests/test_clients.py Outdated Show resolved Hide resolved
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Sep 14, 2024
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Sep 16, 2024
@dosubot dosubot bot removed the size:XL This PR changes 500-999 lines, ignoring generated files. label Sep 16, 2024
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Sep 16, 2024
Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why some requests were missing from fixtures, but otherwise LGTM and thanks for doing this

paperqa/types.py Outdated Show resolved Hide resolved
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 16, 2024
Co-authored-by: James Braza <[email protected]>
@nadolskit nadolskit merged commit 2f7e462 into main Sep 16, 2024
5 checks passed
@nadolskit nadolskit deleted the broken-title-search-UT branch September 16, 2024 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants